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ANALYSIS OF RECENT PERFORMANCE RECORDS FOR THE WHIRDmiD 
COMPUTER SYSTEM 
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Gon^irehensive records of all system failures in the Whirlwind 
conpiter and its associated termnal equipment over a 20 week 
period show that the average uninterrupted operating time between 
failure incidents was 10.6 hotirs. The average time lost for each 
of the 2lil4, incidents was 22,8 minutes. The percentage of operating 
time usable was 96,5 per cent. Computer alarms accounted for 
37 per cent of the stoppages but only for 12 per cent of the lost 
time. Failures caused by design weaknesses required more time for 
correction op, the average than the other classes of failure analyzed 
Assuming that some major improvements in weak sections of the 
system had been carried out^ it was estimated that the same failures 
might have averaged only 16.8 minutes of lost time per failure. 

1.0 COflPUTER-PERFORMANCE RECORDS 

1.1 Coverage 

Following the revisions in the Cape Cod Direction Center 
facilities in July^ 195Uj> the Whirlwind computer and its associated input 
and output system entered a period in which the equipment has remained 
relatively stable. Di September^ 195^# the procedures for gathering and 
evaluating performance data on the computer system were somewhat revised. 

This was done to permit more comprehensive analyses of system reliability 
with particular emphasis on interrupting failures. In general^ the new 
procedures provide more complete data on all conputer stoppages and a bi¬ 
weekly review and siuomaiy of these stoppages. The records are intended 
to reflect all failures in the conputer and its terminal equipment that would 
have caused interruptions if the Gajje Cod System had been in full scale 
operation continuously. Actually, for a large fraction of the time that the 
conputer was in use, much of the Cape Cod terminal equipment was not required. 
(This terminal equipment conprlses about UO per cent of the entire system 
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which has approximately 12^700 tubes),. Under these circumstances^ failures 
in the tejjminal equipment may not have resulted ^ loss of computer time. 
Failures which do not cause interruptions^ however^ must be considered in 
order to obtain an accurate picture of system performance. These are 
considered to be ”poten1n.al]y interrupting” and are given the same weight 
as those that actually halted operations. 

1.2 Organization of Records 

Past Whirlwind computer experience had indicated that most of 
the interrupting failTJres couM be placed into a relatively few categories 
which defined either the cause of the failure or its principal ^ntptom. In 
the record system set up last September^ the following categories were 
selected? 


Tubes 

(cause) 

Wiring, cabling, jacks, connec-bors, etc. 

(cause) 

Circ-uit conponents (other than tubes) - 

(cause) 

Blo-wn fuses 

(symptom) 

Gomputer alarms 

(symptom) 

Design weaknesses 

(cause) 

Mis cellaneous 



The failures listed in the blown^fuse and computer=>alarm categories are ones 
for which true causes cannot be immediately determined. In general^ such 
failures have no associated equipment damage. Examples of incidents in the 
miscellaneous category are an insulation breakdown on a phenolic panels an 
air conditioning failure^ an unseated tube or loose wire inadvertently caused 
while doing essential maintenance^ and a malfunction of a piece of terminal 
equipment which cleared up before the fault could be found. 

For each failure, the amount of time lost is that time 
required to restore the system to operation after the interruption. In the 
majority of the component and circuit failures^ this includes the time 
required to isolate and replace the defective item. In the newer sections 
of the system having plug=in -units, it include only the -time to locate 
and replace the plug~in ipait. For computer°alarm stoppages, it includes the 
time required to photograph the control and indicator panels and to record 
pertinent data on the program being run at that time. This information is 
then studied at leisure -to detect possible causes of the alarms. 

The records of interrupting and potentially-interitipting 
failures are fur-fcher broken down to show -those which must be charged against 
the system and those -siiich can be attributed to new equipment installation or 
re-vision. Because the central computer and its teraiinal equipment are an 
integral elec-trical system, failures in new equipment can cause transients 
which interrupt the computer, even though the new equipment is logically 
independent of the rest of the system. Therefore, until a new installation 



Memorandam 6M=3U10 


Page 3 of 9 


has been debugged and adequate routine^maintenance procedures have been 
"worked out^ failures attributable to such equipment are not counted against 
the ^steiDo 

2.0 ANALTSIS OF PERFORMANCE DATA 

Several figures are needed to adequately describe the 
rejiabiliiy of an electronic system. In general^ system reliability is 
reflected in the amount of unscheduled down time caused by interrupting 
failures and in the amount of scheduled down time required for preventive 
maintenance. Since the amount-of down time for different types of 
interrupting failures varies widely^ the frequency of such failures is also 
an iiqjortant factor in describing ^stem reliability. In the following 
paragraphs such reliability figures for the Whirlwind computer and its 
associated Cape Cod terminal equipment are given. These figures were 
derived from an analysis of data gathered over the 20-week period from 
28 September 195U to 10 Febniary 19^5 • 

2.1 DERIVATION OF LOST-TIME AVERAGES 

It was pointed out previously that sections of the Cape Cod 
terninal equipi^nt are not involved in some of the computer applications 
work so failures in this equipment may not cause loss of computer time. 
Considering this varied use of the computer^ two alternatives for obtaining 
representative figures of system reliability are suggested. Either (l) 
the analyses are restricted to the central computer alone^ or (2) all 
failures (both interrupting and potentially interrupting) are counted and 
lost-time data is extrapolated to give a measure of over-all system 
reliability. The second method was chosen for the following reasons? 

a. Accurate records had been kept of all potentially- 
interrupting failures that had been detected and the 
number of such failures was consistent with the number 
of actual lost-time incidentsj 

bo The central computer is not representative of some of 
the terminal equipment| 

Co Since the terminal equipment is always on and can in¬ 
directly affect the central computer^ isolation of 
failures to the central portion of the computer in 
some cases is questionable^ 

do The records of time spent on preventive maintenance 
cannot be broken down among different sections of the 
system. 

To determine the theoretical^ or extrapolated^ lost time 
for each category of failures^ the average lost time per lost-time failure 
was calculated^ and this average was multiplied ly the total number of 
failure incidents (interrupting and pot^tially interrupting) in that 
category. The sum of the extrapolated figures for all categories is the 
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total lost=timp figipe desiredo This figujre divided the total number 
of failure incident| is the average lost time per incident for all incidents. 

In determining the average lost time per failure for three 
of the categories 5 a few incidents were not considered in confuting the 
averages because the time lost was disproporinonately long. The failtire^ 
duration distribution for the three categories alarms^ miscellaneous^ 
and fuses is shown in Pig, 1, One incident in each of the first two 
categories and two incidents in the third were disregarded, A stu(^ of the 
records showed that three of these incidents had occurred during time 
assigned to the systems engineering group and that more time was spent in a 
thorough analysis of the failures than otherwise would have been required to 
restore operation. The fourth incident was a major air=conditioning failiire 
which occurred on a week-end when service personnel were not readily 
available, 

, In Table I the number of lost time incidents and the amount 

of actual lost time for each categoiy of failures are listed in the first 
two columhs. The third and fourth columns show the number of incidents and 
corresponding lost-time figures used in con 5 )uting the averages ^ven in the 
last column, 

TABLE I 

LOST-TIME-FAILDRE DATA 


Category of 
failure 

Number of 
lost-time 
incidents 

Total 
ijinutes 
lost time 

Data excluded 
in coin^ting 
averages 
Number Minutes 
of lost 

incidents 

Average lost 
time per 
incident 
(Mnutes) 

Cbimxiter llubes 


hht 



29 lB 

Power Supply Tubes 

7 

102 



59,0 

Wiring^ Cables^etc 

„ 6 

220 



36,7 

Cpsa^onents 

8 

3k9 



U3o6 

Blown P^ses 

15 

3U6 

2 

160 

11+.3 

Alarms 

83 

6^2 

1 

60 

7.2 

Design Weaknesses 

15 

1093 



73.0 

Miscellaneous 

ho 

1626 

1 

750 

22,5 


Using the averages of Table I^ extrapolated lost-time figures 
were calculated to reflect all failure incidents. These figures are shown in 
Table II, The totals in this tfble determine ttat the average time lost for 
the 2hh failure incidents is 22^8 mimutes. 
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TABLE II 

EXTRAPOUTED L0ST«,TIME DATA 


Categozy of 
failure 


Number of 

BO“lG8t=> 

time 

incidents 


Total 
Humiber of 
failure 
incidents 


Average lost 
time per 
incident 
(minutee) 
(FROM TABLE I) 


Total 

extrapolated 
lost time 
(MLautes) 


Gonputer Tube* 

12 

27 

29.8 

^oT 

Power supply Tubes 

1 

8 

59.0 

U72 

VH-ringj Cables^ etc< 

> 1 

7 

36.7 

257 

Components 

8 

16 

U3.6 

691 

Blown Fuses 

18 

33 

lUo3 

U72 

Alanis 

8 

91 

7.2 

655 

Design Weaknesses 

1 

16 

73»0 

1168 

MLscellaneous 

6 

k6 

22.5 

1035 

Totals 


2UU 


5561 


Average lost time per incident 


_ 5561 ^ 


22o8 mine 


2o2 Analysis of Failure Categories 

The extrapolated lost=tiros and average loat>=>tirae figures for 
the various categories of failures as given in Table II contain some interest¬ 
ing pointSo The failures in three categoriess tubes (coieputer types and - 

power^supply types corabined)^ design weaknesses^ and mdscellaneous^ were 
respbnaible for 63 per cent of the time lost^ while 70 per cent of the failure 
incidents we,re in the alarm^ miscellaneous^ and blown»fuse categories. 

The relative contributions of the various categories are better 
shown by the data in Table III. Each class of failures hes three quantities 
Ustedj its percentage of the total failure incidents^ its percentage of the 
total lost time^ and the ratio of its average lost time per incident to the 
overfall average lost M.me per incident. Extremes in this data occur for the 
alarm ^d the desigU'^weakness categories. Alarms were by far the most 
frequent type of failure while design weaknesses required the roost time for 
correction. The computer records show iiiat in several of the cases of design 
weakness^ the inarginal checking or other preventive maintenance facilities 
were inadequate so incipient trouble had notteaidetected and signal tracing 
techniques were required to locate the fault. 
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TABLE III 

COMPARISON OF FAILURE CATEGCHilES 


Percent of Ratio of lost-time 


Categojy 

of 

failure 

total number 
of failure 
incidents 

Percent of 
total lost 
time 

average for category 
to lost-time average 
for all incidents 

Computer Tubes 

11.0 

14.5 

1.3 

Power Supply Tubes 

3o3 

8,5 

2.6 

Wiring^ GableSaetc 

. 2.9 

4.6 

1.6 

Components 

6,6 

12.5 

1.9 

Blown Fuses 

13.5 

8.5 

0.6 

Alaimis 

37.3 

11.8 

0.3 

Design Weaknesses 

6,6 

21,0 

3.2 

IB.scellaneous 

18,8 

18.6 

1.0 


Since tubes are known to have the highest failure rate of all 
conponents in a conpiter, system^ an estimate of the number of stoppages 
caused tubes is of interesto For this estimate it is assumed that about 
85 per cent of the alarms and blown fuses were caused by tube defects. With 
this assumption^ the% approadmately 60 per cent of the total incidents and 
40 per cent of the time lost may be attributed to tube failures. 

Some information on coiEponent-failure rates can bo derived 
from historical records on the system. During the 20‘=>week period in question^ 
a total of 437 tubes were replaced in the ^stem. ReplaceBaents for accidental 
damage were excluded. Since 35 of these were interrupting or potentially 
interrupting failures^, about 92 per cent of the failures were located during 
scheduled maintenance periods. The tube=failure rate for all causes^ 
computed from the data alreac3|y given and from the total=operatingf‘time figure 
listed in Section 263a 3.»49 per cent of the tube complement per 1000 

horirso The rate for Interrupting tube=failures is 0«12 per cent of the tube 
complement per 1000 ho-urs. These tube-failure rates compare favorably with 
similar data which has been derived in the past ty the group working on tube 
testing and evaluatiqno 

The records on coi^nent replacement show that a total of 
101 components other ihan tubes were replaced. Since there were I6 interrupt¬ 
ing or- ^teatially-interrupting failures caused by such componentSa about 84 
per cent of the total failures were handled during scheduled maintenance timso 

2*3 Over-All System Performnce 

considering the total computer operating time and the 
amount of preventive maintenance and new installation work that was done^ an 
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over=al3, lActure of system perfornance can be obtained. Significant figures 
are the followings 


Total coxnputer operating time 

Total extrapolated lost time 
(calculated from averages) 

Average uninterrupted operating 
■U-me between incidents 

Failure incidents per 2U“hour day 

Percentage operating time usable 


267^ hours 
92o7 hours 

10.6 hours 

2.19 

96.5 per cent 


The figure given above for percentage usable operating^time 
as calculated from the extrapolated lost=timB agrees closely with a figure 
of 96.2 per cent which is the actual percentage of "applications time" usable 
during the 20-week period as determined from operator reports. AppHeations 
time is the time during which -Hie S3rstem is used by programming groups 
rather than by engineeriisg and maintenance personnel. 


A suranmry of the preventive maintenance and installation work 
is shown in the plots of Fig. 2. Hew installation and modification projects 
were essentially coiipleted by the middle of the period. The required 
preventive maintenance also decreased and for about three months has remained 
relatively constant at about 1.2$ hours per day. 


A stuc^ of the failure frequencies over the 20=week period 
since September^ 195Ua does not show any meaningful variations. The total 
failure incidents as well as the ntimber in each category are plotted for each 
two««week period in Fig. Although the total number of failures dropped 
sli^tly dunng the last 8 weeks^ the failure patterns for the various 
categories are too inconsistent to consider the decrease as a significant 
trend. 


3.0 ESTIMATED PERFORMANCE OF IMPROVED SYSTEM 

A review of the system^f allure records points up the fact that 
a few sections of the con^ter have been responsible for an appreciable 
fraction of the lost timOo If an engineering effort to improve these sections 
were justified^ it seems reasonable that a significant reduction in lost time 
might be realized. In order to obtain some impression of what the systemp* 
performance record might be if this work were done^ each incident was reviewed 
and lost=itime figures were reduced for failures in those sections that mi^t 
be improved. ]n making the estimates it was further assximed that all failures 
were repaired as rapidly as practicable as if they had occurred daring 
applications time. 

The data to be presented is not intended as proof that an 
improveTnent program should be undertaken on the Hhirlwind system. Rather it 
it given to permit more realistic estimates of the reliabili-ty that might be 
e3q)ected in a new system design. 
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A summary of the estimated time lost under the conditions 
described above is ^ven in Table IV* The largest reduction in time lost 
aj^earsa as mi^t be expected^ in the design-weakness categary^ and some 
redaction is shown in all categories* If major intern improvements had 
been acconplished^ the number of failures in the design-weakness and mis- 
cell^eous' cate^ries could be expected to decrease* Since this would tend 
to balance ai^ optiraLs-tic estimates for the other categories^ the calculated 
average of 16»8 minutes lost-time per failure would seem to be reasonable* 

TABLE IV 

ESTIMATED LOST-TIME DATA FOR IMPROVED SYSTEM 

Average Total 

estimated number of Extrapolated 
Number of Estimated lost-time failure estimated 
lost-time lost time per incidents lost time 

_ incidents (minutes) incident (From Table II) (Minutes) 

Conputer Tubes 

Power Supply Tubes 

Wiring, Cables,etc* 

Components 

Blown Fuses 

Alarms 

Design Weaknesses 
Miscellaneous 
Totals 



Category of 
failure 


ESR/bj 

Attacheds B-62051 
A-62G50 
B-62GU9 
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Distributions 

MIT 

J* A» Ackley 
Ho Mo Alperin 
Po Ho Bagley 
E. Lo Best 
Wo Jo Canty 
Jo Eo Crane 
Ao ®o Curtiss 
Ho Lo Daggett 
Eo Ho Gould 
Lo Do Healey 
Ho Wo Hodgdon 
Lo Lo Holmes 
Wo Ao Hosier 
Co Lin 

To Ho Meisling 
Ko Eo McVicar 
Do Ao Morrison 
B# Eo Morriss 
Lo Ho Horcott 
Jo Ao O'Brien 
¥o Ogden 
Eo Eo Olsen 
Bo Bo Paine 
Wo Ho Papian 
Eo Wo Pughe 
Ao Jo Eoberts 
No Ho Taylor 
So Twicken 
To Jo Sandy 
Ao V* Shorten 
Co W. Watt 
Po Youtz 
So Lo Thompson 


IBM 

Jo Jo Belet 
Eo Ho Goldman 
Go Wo Hallgren 
Ho Eo Heath 
Bo Housman (Lexo) 
Eo Ao Imm 
Ho Do Boss 
Eo Co Sau^son 
Wo Ho Thomas 
Lo Eo Walters 
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